Skip to content

roachtest: capture verbose SSH logging on failures#92667

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
smg260:roachtest_capture_verbose_ssh
Jan 18, 2023
Merged

roachtest: capture verbose SSH logging on failures#92667
craig[bot] merged 1 commit intocockroachdb:masterfrom
smg260:roachtest_capture_verbose_ssh

Conversation

@smg260
Copy link
Copy Markdown
Contributor

@smg260 smg260 commented Nov 29, 2022

This change captures verbose SSH log output to the same directory as other test logs. The output file is deleted if no errors were encountered.

Also included in this PR:

  • Thread through retry options via each caller of c.ParallelE thus making retries optional. This will later be exposed via API in roachtests.
  • Fix stack depth in logs for failure reporting
  • Initialise ssh sessions with commands to indicate single use
  • Improved naming of ssh logs (consistent with run logs in the form of ssh|run_timestamp_node_cmd)
  • Optional debug name for ssh commands, to use in debug filenames

Release note: None
Epic: CRDB-21386

@smg260 smg260 force-pushed the roachtest_capture_verbose_ssh branch 3 times, most recently from 6b3ab03 to 7aa1aee Compare November 30, 2022 14:03
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

format = strings.Repeat(" %v", len(args))[1:]
}
reportFailure := newFailure(errors.NewWithDepthf(1, format, args...), collectErrors(args))
reportFailure := newFailure(errors.NewWithDepthf(2, format, args...), collectErrors(args))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately obvious why 2 is better than 1. Presumably, this is so that runtime.Callers will skip this frame (and the ones below it)? It might be worth explaining this in a comment; personally, I find the depth arithmetic error-prone.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The depth here tells us how many of the most recent frames to ignore. That should give us the, in the logs, where t.Fatalf was called from (which can be from roachtest internal, or an actual roachtest).

Previously, we were creating this error from t.Fatalf, and the depth was 1. During more recent refactoring, the error creation was moved to this shared addFailure, which is 1 more call away, but the depth was not incremented to account for that.

@smg260 smg260 force-pushed the roachtest_capture_verbose_ssh branch 6 times, most recently from 9decf06 to 58a92bd Compare December 9, 2022 02:44
@smg260 smg260 marked this pull request as ready for review December 9, 2022 15:12
@smg260 smg260 requested a review from a team as a code owner December 9, 2022 15:12
@smg260 smg260 requested review from herkolategan and srosenberg and removed request for a team December 9, 2022 15:12
Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)


pkg/cmd/roachtest/test_impl.go line 336 at r1 (raw file):

Previously, smg260 (Miral Gadani) wrote…

The depth here tells us how many of the most recent frames to ignore. That should give us the, in the logs, where t.Fatalf was called from (which can be from roachtest internal, or an actual roachtest).

Previously, we were creating this error from t.Fatalf, and the depth was 1. During more recent refactoring, the error creation was moved to this shared addFailure, which is 1 more call away, but the depth was not incremented to account for that.

Right. It might still be worth a single-line comment since in majority of the case you'd expect to see 1. Btw, just saw the same fix was also applied in this PR: https://github.com/cockroachdb/cockroach/pull/94361/files#diff-c0e178682162fce5de4707f6c5cd4fb00efecb1b57f9b7d3922b5938877b719cL336-R336

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Are there any open items? I noticed one TBD that may have been already resolved? Meanwhile, I'd like to take a look at the CI run to see how the ssh debug looks with the rest of the logs.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)


pkg/roachprod/install/cluster_synced.go line 145 at r2 (raw file):

// number and a wrapper error.
func runWithMaybeRetry(
	l *logger.Logger, runRetryOpts *RunRetryOpts, f func() (*RunResultDetails, error),

Nit: runRetryOpts seems redundant; I actually liked the previous name; curious why you thought it needed to be more explicit.


pkg/roachprod/install/session.go line 51 at r2 (raw file):

}

// TODO: MG investigate - ssh log is not produced for the 3rd and final attempt

Still unresolved?


pkg/roachprod/install/session.go line 55 at r2 (raw file):

	var logfile string
	var loggingArgs []string
	cl, err := l.ChildLogger(fmt.Sprintf("ssh_%s_%s", host, timeutil.Now().Format(time.RFC3339)))

Since these will be appended to l.File, maybe prefix with something like ssh_debug to make it easy to filter; i.e., grep -v ssh_debug_?

Also, I wonder if interleaving could pose a problem; it might be better to redirect all ssh debug to a separate file.


pkg/roachprod/install/session.go line 56 at r2 (raw file):

	var loggingArgs []string
	cl, err := l.ChildLogger(fmt.Sprintf("ssh_%s_%s", host, timeutil.Now().Format(time.RFC3339)))
	// running roachprod from the cli will result in a fileless logger

Nit: comment is a not a full sentence :)


pkg/roachprod/install/session.go line 57 at r2 (raw file):

	cl, err := l.ChildLogger(fmt.Sprintf("ssh_%s_%s", host, timeutil.Now().Format(time.RFC3339)))
	// running roachprod from the cli will result in a fileless logger
	if err == nil && l.File != nil {

Might be worth adding a debug flag to disable verbose (ssh) logging (quickly) in case we run into an unwanted side-effect.

Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 145 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Nit: runRetryOpts seems redundant; I actually liked the previous name; curious why you thought it needed to be more explicit.

Agreed, and I can't remember why - may have happened as part of a rename/refactor by the ide.


pkg/roachprod/install/session.go line 51 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Still unresolved?

Seems like you may have looked at an older revision in reviewable - that TODO is gone.


pkg/roachprod/install/session.go line 55 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Since these will be appended to l.File, maybe prefix with something like ssh_debug to make it easy to filter; i.e., grep -v ssh_debug_?

Also, I wonder if interleaving could pose a problem; it might be better to redirect all ssh debug to a separate file.

They are currently both prefixed with ssh_ and stored in a separate ssh/ directory next to the other logs. The ssh log names are also encoded in a similar manner to the run logs, which includes a nano precision ts, node number and command.

Here is a nightly run with ssh debugging for failures:

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/7901775?buildTab=overview&showRootCauses=true&expandBuildChangesSection=true&expandBuildTestsSection=true#%2Fbackup;%2Fimport%2Ftpcc%2Fwarehouses%3D4000%2Fgeo%2Frun_1%2Fartifacts.zip!%2Fssh


pkg/roachprod/install/session.go line 57 at r2 (raw file):

Previously, srosenberg (Stan Rosenberg) wrote…

Might be worth adding a debug flag to disable verbose (ssh) logging (quickly) in case we run into an unwanted side-effect.

Sure we can do that - as it stands, we are only preserving the logs for SSH commands that return a non-zero exit code. Perhaps the flag can be used to specify an ssh debug mode None, Failed, All?

@smg260 smg260 force-pushed the roachtest_capture_verbose_ssh branch from 58a92bd to 488306e Compare December 29, 2022 18:10
Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r2, 1 of 4 files at r6, 3 of 5 files at r9, 3 of 3 files at r10, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)


pkg/roachprod/install/session.go line 55 at r2 (raw file):

They are currently both prefixed with ssh_ and stored in a separate ssh/ directory next to the other logs.

Nice, I didn't realize they were being stored in a separate dir. One downside of using ssh debug is the lack of timestamps. E.g., looking at [1], we mostly care about commands executed, the rest is mostly noise,

debug1: Sending command: export ROACHPROD=1 GOTRACEBACK=crash && bash -c "cd /mnt/data1/activerecord-cockroachdb-adapter/ && sudo RUBYOPT=\\"-W0\\" TESTOPTS=\\"-v\\" bundle exec rake test"

This reminded me of a bastion that my former colleague open-sourced. He is using a tty-recorder [2] which timestamps all commands and as a bonus makes it easy to replay the whole session. The setup is pretty simple [3]. I wonder if we should incorporate it instead of ssh debug?

[1] https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_RoachtestNightlyGceBazel/7901775:id/activerecord/run_1/artifacts.zip!/ssh/ssh_145107.952226743_n1_cd-mntdata1activerec.log
[2] https://pkg.go.dev/maze.io/x/ttyrec
[3] https://github.com/Inpher/sb/blob/main/cmd/ttyrec.go#L113-L137

@smg260 smg260 force-pushed the roachtest_capture_verbose_ssh branch from 488306e to 49be9d7 Compare January 11, 2023 18:32
@smg260
Copy link
Copy Markdown
Contributor Author

smg260 commented Jan 11, 2023

Adding timestamps to SSH debug statements in stderr works but relies on redirecting to a file, thus remove stderr from the CombinedOutput which may be used by callers for string matching.

This PR's use of -E redirects SSH debug logging to a specified file while preserving stderr generated by the encapsulated command.

Will need to revisit adding timestamps at a later stage

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to revisit adding timestamps at a later stage

Any ideas on how we'd recover timestamps without refactoring CombinedOutput?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)

@smg260
Copy link
Copy Markdown
Contributor Author

smg260 commented Jan 13, 2023

Nothing simple comes to mind since the only solutions involve omitting -E and having debug messages printed to stderr, interleaving them with the actual remote command's stderr.

We could write a timestamp at the beginning and end of a command. Not ideal, but better than nothing?

Copy link
Copy Markdown
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @smg260)

Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 132 at r11 (raw file):

// defaultSCPRetry assumes any error is retryable
var defaultSCPRetry = newRunRetryOpts(defaultRetryOpt, nil)

Nit: a more readable version of this would be to define a alwaysRetry function and pass that here. nil doesn't have an obvious interpretation, meaning you have to resort to comments (which can get outdated), or reading the code of the retrying function to understand the semantics.


pkg/roachprod/install/cluster_synced.go line 147 at r11 (raw file):

	l *logger.Logger, retryOpts *RunRetryOpts, f func() (*RunResultDetails, error),
) (*RunResultDetails, error) {
	if retryOpts == nil {

Nit: could we have a default config when none is passed? We'd avoid the special case and it would make the errors more uniform.


pkg/roachprod/install/session.go line 63 at r11 (raw file):

	if command.debugName == "" {
		command.debugName = GenFilenameFromArgs(20, strings.Fields(command.cmd)...)

Nit: we don't have to use strings.Fields()..., do we?


pkg/roachprod/install/session.go line 78 at r11 (raw file):

		logfile = cl.File.Name()
		loggingArgs = []string{
			// Extra verbosity is unlikely to give us meaningful information when commands fail.

I don't understand: why is it unlikely? Also, why only -v and not -vvv (as the commit message says)?


pkg/roachprod/install/session.go line 86 at r11 (raw file):

		loggingArgs = []string{"-q"}
	}
	//const logfile = ""

🔥


pkg/roachprod/install/session.go line 112 at r11 (raw file):

func (s *remoteSession) errWithDebug(err error) error {
	if err != nil && s.logfile != "" {
		err = errors.Wrapf(err, "ssh verbose log retained in %s", filepath.Base(s.logfile))

Wouldn't this remove the ssh/ directory?

Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)

a discussion (no related file):
Looking forward to this! 🎉


@renatolabs renatolabs dismissed their stale review January 17, 2023 20:08

clicked by accident

Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/roachprod/install/cluster_synced.go line 132 at r11 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: a more readable version of this would be to define a alwaysRetry function and pass that here. nil doesn't have an obvious interpretation, meaning you have to resort to comments (which can get outdated), or reading the code of the retrying function to understand the semantics.

Good idea


pkg/roachprod/install/session.go line 63 at r11 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Nit: we don't have to use strings.Fields()..., do we?

Good catch - we don't. A previous iteration of GenFilenameFromArgs did need it.


pkg/roachprod/install/session.go line 78 at r11 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I don't understand: why is it unlikely? Also, why only -v and not -vvv (as the commit message says)?

I will change back to vvv - since we only keep erroneous command logs. I find the extra verbosity obfuscates any useful information. This will result in client/server/extra debugging.

Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/roachprod/install/session.go line 112 at r11 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Wouldn't this remove the ssh/ directory?

This just ensures that when we record an error, we don't delete the file on Close()

if s.logfile != "" {
		_ = os.Remove(s.logfile)
	}

Currently we do not capture SSH logs in the event of a
command failure, which can be useful in debugging issues,
transient or otherwise.

This commit enables logging via the ssh switch -vvv and
specifying a log filename, to be stored under an ssh/
directory in the test log root. The debug file is deleted
upon successful (zero) exit of the command, but preserved
for non-zero exits for further inspection.

Additionally,
- The name of the log is consistent with the
corresponding run log and encodes a node number and
timestamp.
- SSH sessions must now be initialised with the
command itself to re-inforce its single use nature.
- Debug friendly command names can optionally be
specified to influence the name of the run/ssh logs.
- Retry options can optionally be omitted from any call
to ParallelE to disable retries

Release note: None
Epic: CRDB-21386
@smg260 smg260 force-pushed the roachtest_capture_verbose_ssh branch from 49be9d7 to 1de6c82 Compare January 18, 2023 03:00
Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @smg260, and @srosenberg)


pkg/roachprod/install/session.go line 112 at r11 (raw file):

Previously, smg260 (Miral Gadani) wrote…

This just ensures that when we record an error, we don't delete the file on Close()

if s.logfile != "" {
		_ = os.Remove(s.logfile)
	}

I meant the filepath.Base part -- the filename in the logs will omit the ssh/ directory which could be confusing.

Copy link
Copy Markdown
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/roachprod/install/session.go line 112 at r11 (raw file):

Previously, renatolabs (Renato Costa) wrote…

I meant the filepath.Base part -- the filename in the logs will omit the ssh/ directory which could be confusing.

Ah, it will but I'd think that the ssh directory itself would be enough of a clue. Will prefix with ssh in a future PR if it there is any confusion.

@smg260
Copy link
Copy Markdown
Contributor Author

smg260 commented Jan 18, 2023

TFTR

bors r=srosenberg

@craig
Copy link
Copy Markdown
Contributor

craig Bot commented Jan 18, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants